Topic: Cardiovascular Disease Analysis and Prediction
Team Members: Zhang Haoran, Huang Menghui, Sun Qiyang, Tham Yong Hao
Tutorial Group: EE08
Professor: Chen Lihui
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year. Prediction and precaution of CVDs are vital for ones' survival rates. Thus, we want to analyse a cardiovascular diseases related dataset which consists of 70000 data collected at the moment of medical examination, and then fit them into different machine learning models. In the end, we will develop an application for real time prediction based on what we learned from this datatset.
Data Cleaning:
Data Preparation:
Exploratory Data Analysis & Visualisation:
Machine Learning:
Data Merge:
Application Development:
Remarks:
# !pip install plotly==5.3.1
# !pip install xgboost==1.3.3
# !brew install libomp
# !pip install tensorflow
# !pip install hyperopt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import rcParams
import plotly.express as px
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score,accuracy_score,classification_report, recall_score
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import permutation_importance
from sklearn import decomposition
import xgboost as xgb
from xgboost import XGBClassifier
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe, space_eval
from sklearn.ensemble import AdaBoostClassifier
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import pickle
import time
Source of dataset: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
Data description There are 3 types of input features:
Objective: factual information; Examination: results of medical examination; Subjective: information given by the patient.
Features:
df = pd.read_csv('cardio_train.csv',sep=';')
df.head()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
df.describe()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 |
| mean | 49972.419900 | 19468.865814 | 1.349571 | 164.359229 | 74.205690 | 128.817286 | 96.630414 | 1.366871 | 1.226457 | 0.088129 | 0.053771 | 0.803729 | 0.499700 |
| std | 28851.302323 | 2467.251667 | 0.476838 | 8.210126 | 14.395757 | 154.011419 | 188.472530 | 0.680250 | 0.572270 | 0.283484 | 0.225568 | 0.397179 | 0.500003 |
| min | 0.000000 | 10798.000000 | 1.000000 | 55.000000 | 10.000000 | -150.000000 | -70.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 25006.750000 | 17664.000000 | 1.000000 | 159.000000 | 65.000000 | 120.000000 | 80.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 50% | 50001.500000 | 19703.000000 | 1.000000 | 165.000000 | 72.000000 | 120.000000 | 80.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 75% | 74889.250000 | 21327.000000 | 2.000000 | 170.000000 | 82.000000 | 140.000000 | 90.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| max | 99999.000000 | 23713.000000 | 2.000000 | 250.000000 | 200.000000 | 16020.000000 | 11000.000000 | 3.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 70000 entries, 0 to 69999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 70000 non-null int64 1 age 70000 non-null int64 2 gender 70000 non-null int64 3 height 70000 non-null int64 4 weight 70000 non-null float64 5 ap_hi 70000 non-null int64 6 ap_lo 70000 non-null int64 7 cholesterol 70000 non-null int64 8 gluc 70000 non-null int64 9 smoke 70000 non-null int64 10 alco 70000 non-null int64 11 active 70000 non-null int64 12 cardio 70000 non-null int64 dtypes: float64(1), int64(12) memory usage: 6.9 MB
df.isnull().sum()
id 0 age 0 gender 0 height 0 weight 0 ap_hi 0 ap_lo 0 cholesterol 0 gluc 0 smoke 0 alco 0 active 0 cardio 0 dtype: int64
df['cardio'].value_counts()
0 35021 1 34979 Name: cardio, dtype: int64
fig = px.histogram(df['cardio'], color=df['cardio'])
fig.update_layout(bargap=0.2)
fig.show()
This indecates that the value of '0' and '1' is in proportion of 35021/34979=1.001200, which is almost 1. So the data is impartial.
# Drop id column as it is not useful
df = df.drop('id', axis=1)
df
| age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 19240 | 2 | 168 | 76.0 | 120 | 80 | 1 | 1 | 1 | 0 | 1 | 0 |
| 69996 | 22601 | 1 | 158 | 126.0 | 140 | 90 | 2 | 2 | 0 | 0 | 1 | 1 |
| 69997 | 19066 | 2 | 183 | 105.0 | 180 | 90 | 3 | 1 | 0 | 1 | 0 | 1 |
| 69998 | 22431 | 1 | 163 | 72.0 | 135 | 80 | 1 | 2 | 0 | 0 | 0 | 1 |
| 69999 | 20540 | 1 | 170 | 72.0 | 120 | 80 | 2 | 1 | 0 | 0 | 1 | 0 |
70000 rows × 12 columns
# Check for duplicated dataset
df[df.duplicated(keep=False)].sort_values(by=['gender', 'height', 'weight'], ascending= False).head()
| age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10494 | 16937 | 2 | 170 | 70.0 | 120 | 80 | 1 | 1 | 0 | 0 | 0 | 0 |
| 44653 | 16937 | 2 | 170 | 70.0 | 120 | 80 | 1 | 1 | 0 | 0 | 0 | 0 |
| 1142 | 17493 | 2 | 169 | 74.0 | 120 | 80 | 1 | 1 | 0 | 0 | 1 | 1 |
| 50432 | 17493 | 2 | 169 | 74.0 | 120 | 80 | 1 | 1 | 0 | 0 | 1 | 1 |
| 32683 | 17535 | 2 | 165 | 65.0 | 120 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
# Remove duplicated dataset as it doesn't contribute to prediction but just increase size of training data
df = df.drop_duplicates(keep = 'first')
df
| age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 19240 | 2 | 168 | 76.0 | 120 | 80 | 1 | 1 | 1 | 0 | 1 | 0 |
| 69996 | 22601 | 1 | 158 | 126.0 | 140 | 90 | 2 | 2 | 0 | 0 | 1 | 1 |
| 69997 | 19066 | 2 | 183 | 105.0 | 180 | 90 | 3 | 1 | 0 | 1 | 0 | 1 |
| 69998 | 22431 | 1 | 163 | 72.0 | 135 | 80 | 1 | 2 | 0 | 0 | 0 | 1 |
| 69999 | 20540 | 1 | 170 | 72.0 | 120 | 80 | 2 | 1 | 0 | 0 | 1 | 0 |
69976 rows × 12 columns
Number of datapoints removed is 70000 - 69976 = 24
# Converting age from days to years
df['year'] = (df['age']/365).round().astype('int')
df
C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/1379977146.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 | 50 |
| 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 | 55 |
| 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 | 52 |
| 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 | 48 |
| 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 | 48 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 19240 | 2 | 168 | 76.0 | 120 | 80 | 1 | 1 | 1 | 0 | 1 | 0 | 53 |
| 69996 | 22601 | 1 | 158 | 126.0 | 140 | 90 | 2 | 2 | 0 | 0 | 1 | 1 | 62 |
| 69997 | 19066 | 2 | 183 | 105.0 | 180 | 90 | 3 | 1 | 0 | 1 | 0 | 1 | 52 |
| 69998 | 22431 | 1 | 163 | 72.0 | 135 | 80 | 1 | 2 | 0 | 0 | 0 | 1 | 61 |
| 69999 | 20540 | 1 | 170 | 72.0 | 120 | 80 | 2 | 1 | 0 | 0 | 1 | 0 | 56 |
69976 rows × 13 columns
# Reorder columns
df = df.drop('age', axis=1)
df = df[ ['year'] + [ col for col in df.columns if col != 'year' ] ]
#ap_lo and ap_hi cannot be negative or 0, firstly we set these error values with np.nan and then see what to do with these error values
df['ap_hi'].loc[(df['ap_hi']<=0)]=np.nan
df['ap_lo'].loc[(df['ap_lo']<=0)]=np.nan
df
| year | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | 2 | 168 | 62.0 | 110.0 | 80.0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 55 | 1 | 156 | 85.0 | 140.0 | 90.0 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 52 | 1 | 165 | 64.0 | 130.0 | 70.0 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 48 | 2 | 169 | 82.0 | 150.0 | 100.0 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 48 | 1 | 156 | 56.0 | 100.0 | 60.0 | 1 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 53 | 2 | 168 | 76.0 | 120.0 | 80.0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 69996 | 62 | 1 | 158 | 126.0 | 140.0 | 90.0 | 2 | 2 | 0 | 0 | 1 | 1 |
| 69997 | 52 | 2 | 183 | 105.0 | 180.0 | 90.0 | 3 | 1 | 0 | 1 | 0 | 1 |
| 69998 | 61 | 1 | 163 | 72.0 | 135.0 | 80.0 | 1 | 2 | 0 | 0 | 0 | 1 |
| 69999 | 56 | 1 | 170 | 72.0 | 120.0 | 80.0 | 2 | 1 | 0 | 0 | 1 | 0 |
69976 rows × 12 columns
#ap_lo cannot be greater than ap_hi, firstly we set these error values with np.nan and then see what to do with these error values
df['ap_lo'].loc[(df['ap_lo'])>=(df['ap_hi'])]=0
#change the corresponding ap_hi to np.nan
df['ap_hi'].loc[(df['ap_lo'])==0]=np.nan
#Now all the error values are been as nan, then we check the number of null value
df.isnull().sum()
year 0 gender 0 height 0 weight 0 ap_hi 1236 ap_lo 1251 cholesterol 0 gluc 0 smoke 0 alco 0 active 0 cardio 0 dtype: int64
#Since there are thousands of values are null, it could be a waste if we drop them all, so we decide to replace this value with median
#We choose median becasue median is not easy to be affected by outliers
##from sklearn.impute import SimpleImputer
ap_hi=df.loc[:,"ap_hi"].values.reshape(-1,1)
ap_hi_median = SimpleImputer()
ap_hi_median=ap_hi_median.fit_transform(ap_hi)
df.loc[:,"ap_hi"]=ap_hi_median
ap_lo=df.loc[:,"ap_lo"].values.reshape(-1,1)
ap_lo_median = SimpleImputer()
ap_lo_median=ap_lo_median.fit_transform(ap_lo)
df.loc[:,"ap_lo"]=ap_lo_median
df.isnull().sum()
year 0 gender 0 height 0 weight 0 ap_hi 0 ap_lo 0 cholesterol 0 gluc 0 smoke 0 alco 0 active 0 cardio 0 dtype: int64
# Plotting weight against height to have a big picture of the data distribution and outliers
fig = px.scatter(df, x="weight", y="height")
fig.show()
# Plotting ap_lo against ap_hi to have a big picture of the data distribution and outliers
fig = px.scatter(df, x="ap_lo", y="ap_hi")
fig.show()
# Defining a function to keep track of the number of outliers
def outliers(df_out,columns, drop = False):
for each_feature in columns:
feature_data = df_out[each_feature]
Q1 = np.percentile(feature_data, 25.) # 25th percentile of the data of the given feature
Q3 = np.percentile(feature_data, 75.) # 75th percentile of the data of the given feature
IQR = Q3-Q1 #Interquartile Range
outlier_step = IQR * 1.5 #That's we were talking about above
outliers = feature_data[~((feature_data >= Q1 - outlier_step) & (feature_data <= Q3 + outlier_step))].index.tolist()
if (drop == True):
df_out.drop(outliers, inplace=True)
print('Filtering through feature {}, No of Outliers removed is {}'.format(each_feature, len(outliers)))
continue
print('For the feature {}, No of Outliers is {}'.format(each_feature, len(outliers)))
outliers(df,['height', 'weight', 'ap_hi', 'ap_lo'])
For the feature height, No of Outliers is 519 For the feature weight, No of Outliers is 1819 For the feature ap_hi, No of Outliers is 1058 For the feature ap_lo, No of Outliers is 3548
We could see there are thousands of outliers.
As the numbers of outliers for height and weight features is still relatively smaller compared to ap_hi and ap_lo, we will only use this method to clean up the ouliers for height and weight, to avoid too much data loss
# Logarithm function helps us to correct skewness of data and allows us to remove the appropriate outliers besides scaling down the data and reduce data loss
trial_df = df.copy()
trial_df["ap_hi_logged"] = np.log(trial_df["ap_hi"])
print('skewness of ap_hi before log: ', trial_df['ap_hi'].skew())
print('skewness of ap_hi after log: ', trial_df['ap_hi_logged'].skew())
skewness of ap_hi before log: 85.58924279640866 skewness of ap_hi after log: 5.734343654420562
outlier_free_df = df.copy()
outlier_free_df[['height', 'weight']] = np.log(outlier_free_df[['height', 'weight']])
outliers(outlier_free_df, ['height', 'weight'], drop=True)
Filtering through feature height, No of Outliers removed is 484 Filtering through feature weight, No of Outliers removed is 1115
# number of removed datapoints
print('Number of datapoints removed is', len(df)-len(outlier_free_df))
Number of datapoints removed is 1599
# getting back the original height and weight values
outlier_free_df[['height', 'weight']] = df[['height', 'weight']]
# Reassigning outlier_free_df to df
df=outlier_free_df
We will now clean up outliers for ap_hi and ap_lo with data from our research
# A study published in 1995 recorded maximum blood pressure of 370/360 mm Hg
# Diastolic pressure (ap_lo) should not be 0, there should be blood pressure for blood flow
df.drop(df[df['ap_hi']>370].index, inplace=True)
print('Number of points remaining is', len(df))
print('Number of points removed is', 68377-len(df))
Number of points remaining is 68340 Number of points removed is 37
df[['height', 'weight', 'ap_hi', 'ap_lo']].describe()
| height | weight | ap_hi | ap_lo | |
|---|---|---|---|---|
| count | 68340.000000 | 68340.000000 | 68340.000000 | 68340.000000 |
| mean | 164.456292 | 73.795437 | 126.632309 | 81.247261 |
| std | 7.591273 | 12.996633 | 16.471926 | 9.417500 |
| min | 144.000000 | 46.000000 | 12.000000 | 1.000000 |
| 25% | 159.000000 | 65.000000 | 120.000000 | 80.000000 |
| 50% | 165.000000 | 72.000000 | 120.000000 | 80.000000 |
| 75% | 170.000000 | 82.000000 | 140.000000 | 90.000000 |
| max | 187.000000 | 116.000000 | 309.000000 | 182.000000 |
#Create 'bmi' column and move it to the proper position
#BMI=w/(h^2), where w is weight(kg), h is height(m)
df['BMI']= df['weight']/(df['height']*df['height']/10000)
df.insert(8,'bmi',df['BMI'])
del df['BMI']
#Create 'map' column and move it to the proper position
#mean arterial pressure (MAP) = (ap_hi + 2 *(ap_lo))/3
df['MAP']=(df['ap_hi']+2*df['ap_lo'])/3
df.insert(6,'map',df['MAP'])
del df['MAP']
#Create 'pp' column and move it to the proper position
#Pulse pressure (PP) = ap_hi-ap_lo
df['PP']= df['ap_hi']-df['ap_lo']
df.insert(7,'pp',df['PP'])
del df['PP']
#Make gender label in 0 and 1 instead of 1 and 2
df['gender'] = df['gender'] - 1
# Categorical Variables
dfCatData = pd.DataFrame(df[['gender', 'cholesterol', 'gluc', 'smoke','alco','cardio','active']])
dfCatData.head()
| gender | cholesterol | gluc | smoke | alco | cardio | active | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 0 | 3 | 1 | 0 | 0 | 1 | 0 |
| 3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
dfCatData['gender'] = dfCatData['gender'].astype('category')
dfCatData['cholesterol'] = dfCatData['cholesterol'].astype('category')
dfCatData['gluc'] = dfCatData['gluc'].astype('category')
dfCatData['smoke'] = dfCatData['smoke'].astype('category')
dfCatData['alco'] = dfCatData['alco'].astype('category')
dfCatData['cardio'] = dfCatData['cardio'].astype('category')
dfCatData['active'] = dfCatData['active'].astype('category')
# Numerical Variables
dfnum = pd.DataFrame(df[['year','height','weight','ap_hi','ap_lo','pp','map','bmi']])
f, axes = plt.subplots(8, 3, figsize=(30, 35))
colors = ["red", "green", "blue", "magenta", "cyan", "yellow", "purple", "orange"]
count = 0
for var in dfnum:
sns.boxplot(data=dfnum[var], orient = "h", color = colors[count], ax = axes[count,0])
sns.histplot(data=dfnum[var], color = colors[count], ax = axes[count,1])
sns.violinplot(data=dfnum[var], color = colors[count], ax = axes[count,2])
count += 1
fig = px.histogram(df, x="year", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
#to make our dataset fit our objective more, and make our objective more targeted. we decide to drop the year == 30, and make the objectove to target 39-65 aged people.
df = df[df.year>=35]
fig = px.histogram(df, x="year", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
trenddf = pd.DataFrame(df.groupby(['year'])['cardio'].sum()/df.groupby(['year']).size(), columns=['risk']).reset_index()
fig = px.line(trenddf, x="year", y="risk",title='The risk of cardiovascular disease varies with age')
fig.show()
We can see that the risk of getting cardiovascular disease increases with age (year).
fig = px.box(df, x="cardio", y="year",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="height",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="weight",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="ap_hi",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="ap_lo",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="pp",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="map",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
fig = px.box(df, x="cardio", y="bmi",color = 'cardio')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
df_uniques = pd.melt(frame=df, value_vars=['gender','cholesterol',
'gluc', 'smoke', 'alco',
'active', 'cardio'])
df_uniques = pd.DataFrame(df_uniques.groupby(['variable',
'value'])['value'].count()) \
.sort_index(level=[0, 1]) \
.rename(columns={'value': 'count'}) \
.reset_index()
sns.factorplot(x='variable', y='count', hue='value',
data=df_uniques, kind='bar', height=12);
C:\YongHao\Anaconda\envs\dsai\lib\site-packages\seaborn\categorical.py:3717: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
trenddf = pd.DataFrame(df.groupby(['gender','year'])['cardio'].sum()/df.groupby(['gender','year']).size(), columns=['risk']).reset_index()
trenddf['gender'] = trenddf['gender'].map({0: 'female', 1: 'male'})
fig = px.line(trenddf, x="year", y="risk",color='gender',title='The risk of cardiovascular disease varies by gender with age', labels={'female','male'})
fig.show()
Gender - Men are more likely than women to develop cardiovascular disease at an earlier age. Women are slightly more likely than men to develop cardiovascular disease as they get older.
fig = px.histogram(dfCatData, x="gender", color="gender", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
trenddf = pd.DataFrame(df.groupby(['cholesterol','year'])['cardio'].sum()/df.groupby(['cholesterol','year']).size(), columns=['risk']).reset_index()
trenddf['cholesterol'] = trenddf['cholesterol'].map({ 1: 'normal', 2: 'above normal', 3: 'well above normal'})
fig = px.line(trenddf, x="year", y="risk",color='cholesterol',title='The risk of cardiovascular disease varies by cholesterol with age')
fig.show()
In general higher cholesterol levels lead to higher probability of presence of cardiovascular disease.
fig = px.histogram(dfCatData, x="cholesterol", color="cholesterol", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
trenddf = pd.DataFrame(df.groupby(['gluc','year'])['cardio'].sum()/df.groupby(['gluc','year']).size(), columns=['risk']).reset_index()
trenddf['gluc'] = trenddf['gluc'].map({ 1: 'normal', 2: 'above normal', 3: 'well above normal'})
fig = px.line(trenddf, x="year", y="risk",color='gluc',title='The risk of cardiovascular disease varies by gluc with age')
fig.show()
In general, higher glucose levels lead to higher probability of presence of cardiovascular disease.
fig = px.histogram(dfCatData, x="gluc", color="gluc", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
trenddf = pd.DataFrame(df.groupby(['smoke','year'])['cardio'].sum()/df.groupby(['smoke','year']).size(), columns=['risk']).reset_index()
trenddf['smoke'] = trenddf['smoke'].map({ 0: 'nonsmoker', 1: 'smoker'})
fig = px.line(trenddf, x="year", y="risk",color='smoke',title='The risk of cardiovascular disease varies by smoking with age')
fig.show()
It can be seen that there is no significant correlation between non-smoker and smoker with presence of cardiovascular disease
fig = px.histogram(dfCatData, x="smoke", color="smoke", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
We found out that smoke variable is very unbalanced, so we tried to make it balance through undersampling.
dfCatData['smoke'].value_counts()
0 62334 1 6006 Name: smoke, dtype: int64
# Undersampling people with 'smoke' == 0 to have a more reliable analysis
num_smoke = int(len(df[df['smoke'] == 1]))
non_smoke_indices = df[df.smoke == 0].index
random_indices = np.random.choice(non_smoke_indices,num_smoke, replace=False)
smoke_indices = df[df.smoke == 1].index
under_sample_indices = np.concatenate([smoke_indices,random_indices])
under_sample = df.loc[under_sample_indices]
%matplotlib inline
sns.countplot(x='smoke', data=under_sample)
<AxesSubplot:xlabel='smoke', ylabel='count'>
trenddf = pd.DataFrame(under_sample.groupby(['smoke','year'])['cardio'].sum()/under_sample.groupby(['smoke','year']).size(), columns=['risk']).reset_index()
trenddf['smoke'] = trenddf['smoke'].map({ 0: 'nonsmoker', 1: 'smoker'})
fig = px.line(trenddf, x="year", y="risk",color='smoke',title='The risk of cardiovascular disease varies by smoking with age')
fig.show()
We found that undersampling also does not help to improve the reliability of this statistic, so we decided to drop it.
df = df.drop('smoke', axis=1)
trenddf = pd.DataFrame(df.groupby(['alco','year'])['cardio'].sum()/df.groupby(['alco','year']).size(), columns=['risk']).reset_index()
trenddf['alco'] = trenddf['alco'].map({ 0: 'non-drinker', 1: 'drinker'})
fig = px.line(trenddf, x="year", y="risk",color='alco',title='The risk of cardiovascular disease varies by alco with age')
fig.show()
It can be seen that there is no significant correlation between non-drinker and drinker with presence of cardiovascular disease
fig = px.histogram(dfCatData, x="alco", color="alco", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
dfCatData['alco'].value_counts()
0 64680 1 3660 Name: alco, dtype: int64
# Undersampling people with 'smoke' == 0 to have a more reliable analysis
num_alco = int(len(df[df['alco'] == 1]))
non_alco_indices = df[df.alco == 0].index
random_indices = np.random.choice(non_alco_indices,num_alco, replace=False)
alco_indices = df[df.alco == 1].index
under_sample_indices = np.concatenate([alco_indices,random_indices])
under_sample = df.loc[under_sample_indices]
%matplotlib inline
sns.countplot(x='alco', data=under_sample)
<AxesSubplot:xlabel='alco', ylabel='count'>
trenddf = pd.DataFrame(under_sample.groupby(['alco','year'])['cardio'].sum()/under_sample.groupby(['alco','year']).size(), columns=['risk']).reset_index()
trenddf['alco'] = trenddf['alco'].map({ 0: 'non-drinker', 1: 'drinker'})
fig = px.line(trenddf, x="year", y="risk",color='alco',title='The risk of cardiovascular disease varies by alco with age')
fig.show()
We found that undersampling helped to make drinkers seem more risky than non-drinkers in getting cardiovascular disease, but the relationship is still very weak, and we do not think this statistic is reliable, so we decided to drop it.
df = df.drop('alco', axis=1)
trenddf = pd.DataFrame(df.groupby(['active','year'])['cardio'].sum()/df.groupby(['active','year']).size(), columns=['risk']).reset_index()
trenddf['active'] = trenddf['active'].map({ 0: 'inactive', 1: 'active'})
fig = px.line(trenddf, x="year", y="risk",color='active',title='The risk of cardiovascular disease varies by active with age')
fig.show()
We observed that the general trend is that inactive people are more prone to cardiovascular disease compared to active people
fig = px.histogram(dfCatData, x="active", color="active", pattern_shape="cardio")
fig.update_layout(bargap=0.2)
fig.show()
heatmap_df = pd.concat([dfnum, df['cardio']], axis=1)
corr_matrix = heatmap_df.corr()
plt.subplots(figsize = (12, 10))
sns.heatmap(corr_matrix, annot = True, fmt = ".2f")
plt.show()
We can see that the variables that are very highly correlated to cardio are ap_hi, map, ap_lo, pp
Principal component analysis (PCA) is an unsupervised technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

target_value = df['cardio']
cleaned_data_for_pca = df.drop(['cardio'], axis=1)
scaled_data = StandardScaler().fit_transform(cleaned_data_for_pca) # Have to scale, PCA's output is influenced based on the scale of the features of the data.
pca = decomposition.PCA()
pca.n_components = 2
pca_data = pca.fit_transform(scaled_data)
pca_data = np.vstack((pca_data.T, target_value)).T # concatenate with target values
pca_df = pd.DataFrame(data = pca_data, columns = ('principal component 1', 'principal component 2', 'label'))
#pca_df['label']=pca_df['label'].map({0.0: 'No Disease', 1: 'Has Disease'})
pca_df
| principal component 1 | principal component 2 | label | |
|---|---|---|---|
| 0 | -2.014826 | 1.506569 | 0.0 |
| 1 | 2.428809 | -1.877420 | 1.0 |
| 2 | -0.326738 | -0.834095 | 1.0 |
| 3 | 2.621867 | 1.968769 | 1.0 |
| 4 | -3.853166 | -1.121575 | 0.0 |
| ... | ... | ... | ... |
| 68331 | 1.887469 | -0.134695 | 1.0 |
| 68332 | -0.618706 | 1.292674 | 0.0 |
| 68333 | 5.728409 | 2.324621 | 1.0 |
| 68334 | 0.736180 | -0.850707 | 1.0 |
| 68335 | -0.563241 | -0.136935 | 0.0 |
68336 rows × 3 columns
cleaned_data_for_pca.columns
Index(['year', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'map', 'pp',
'cholesterol', 'gluc', 'bmi', 'active'],
dtype='object')
# Shows explained variance of each feature in each component (corrseponding to the column names)
pca.components_
array([[ 0.15861127, 0.04033664, 0.01409833, 0.272979 , 0.49092817,
0.41889317, 0.48728598, 0.3556982 , 0.17883624, 0.11354017,
0.27048933, -0.00317441],
[-0.15799988, 0.58871976, 0.62290437, 0.0491057 , 0.07576417,
0.094145 , 0.0921477 , 0.03103695, -0.26812325, -0.24333521,
-0.28782849, 0.00430624]])
print("Principal Component 1 Variables with highest explained variance: ")
var_ls1 = list(cleaned_data_for_pca.columns)
pc1 = list(pca.components_[0])
i = np.argmax(pc1)
pc1.pop(np.argmax(pc1))
print('1', var_ls1[i])
var_ls1.pop(i)
j = np.argmax(pc1)
pc1.pop(np.argmax(pc1))
print('2', var_ls1[j])
var_ls1.pop(j)
k = np.argmax(pc1)
pc1.pop(np.argmax(pc1))
print('3', var_ls1[k])
var_ls1.pop(k)
l = np.argmax(pc1)
pc1.pop(np.argmax(pc1))
print('4', var_ls1[l])
Principal Component 1 Variables with highest explained variance: 1 ap_hi 2 map 3 ap_lo 4 pp
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
Explained variation per principal component: [0.29795739 0.14095567]
Principal component 1 holds 29.80% of the information while principal component 2 holds 14.10% of the information . Projecting 15-dimensional data to 2-dimensional data resulted in loss of 56.1% of information
fig = px.scatter(pca_df, x='principal component 1', y='principal component 2', color=pca_df['label'])
fig.show()
The variation between those that have cardiovascular disease and those that do not have cardiovascular disease is quite small from PCA analysis, but we are at least able to observe the slight shift in positions between these 2 classes in the plot above and found the features with higher importance (explained variance) for our classification objective.
Source of dataset: https://www.kaggle.com/mathchi/diabetes-data-set
data = pd.read_csv('diabetes.csv')
data.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
# First, the 0 values are replaced with null values
data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)
data.isnull().sum()
Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
# The missing values will be filled with the median values of each variable.
def median_target(var):
temp = data[data[var].notnull()]
temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index()
return temp
# The values to be given for incomplete observations are given the median value of people who are not sick and the median values of people who are sick.
columns = data.columns
columns = columns.drop("Outcome")
for i in columns:
median_target(i)
data.loc[(data['Outcome'] == 0 ) & (data[i].isnull()), i] = median_target(i)[i][0]
data.loc[(data['Outcome'] == 1 ) & (data[i].isnull()), i] = median_target(i)[i][1]
data.isnull().sum()
Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
Now, all the the blanks have been filled
data.head(2)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 |
df.head(2)
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | 1 | 168 | 62.0 | 110.0 | 80.0 | 90.000000 | 30.0 | 1 | 1 | 21.967120 | 1 | 0 |
| 1 | 55 | 0 | 156 | 85.0 | 140.0 | 90.0 | 106.666667 | 50.0 | 3 | 1 | 34.927679 | 1 | 1 |
By comparing the features as well as their value type of the two datasets, we decide use 'Age','BMI' and 'BloodPressure' of Diabetes dataset which are corresponding to 'year','map','bmi' in Cardio dataset.
corr_matrix = data.corr()
plt.subplots(figsize = (12, 10))
sns.heatmap(corr_matrix, annot = True, fmt = ".2f")
plt.show()
Correlation with Outcome(Diabetes Positive):
#rename the feature to make them be consistent with those of the cardio dataset. (Important for model training)
data.rename(columns={'BloodPressure':'map'},inplace=True)
data.rename(columns={'BMI':'bmi'},inplace=True)
data.rename(columns={'Age':'year'},inplace=True)
data.columns
Index(['Pregnancies', 'Glucose', 'map', 'SkinThickness', 'Insulin', 'bmi',
'DiabetesPedigreeFunction', 'year', 'Outcome'],
dtype='object')
data.describe()
| Pregnancies | Glucose | map | SkinThickness | Insulin | bmi | DiabetesPedigreeFunction | year | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 121.677083 | 72.389323 | 29.089844 | 141.753906 | 32.434635 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 30.464161 | 12.106039 | 8.890820 | 89.100847 | 6.880498 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 44.000000 | 24.000000 | 7.000000 | 14.000000 | 18.200000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.750000 | 64.000000 | 25.000000 | 102.500000 | 27.500000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 28.000000 | 102.500000 | 32.050000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 169.500000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
df.describe()
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 | 68336.000000 |
| mean | 53.351030 | 0.349962 | 164.456129 | 73.795820 | 126.633136 | 81.247773 | 96.376227 | 45.385363 | 1.364654 | 1.224552 | 27.329075 | 0.803852 | 0.498361 |
| std | 6.755641 | 0.476961 | 7.591233 | 12.996457 | 16.471932 | 9.417324 | 10.928723 | 11.643811 | 0.679032 | 0.570477 | 4.800928 | 0.397085 | 0.500001 |
| min | 39.000000 | 0.000000 | 144.000000 | 46.000000 | 12.000000 | 1.000000 | 12.000000 | -69.272565 | 1.000000 | 1.000000 | 14.609204 | 0.000000 | 0.000000 |
| 25% | 48.000000 | 0.000000 | 159.000000 | 65.000000 | 120.000000 | 80.000000 | 93.333333 | 40.000000 | 1.000000 | 1.000000 | 23.875115 | 1.000000 | 0.000000 |
| 50% | 54.000000 | 0.000000 | 165.000000 | 72.000000 | 120.000000 | 80.000000 | 93.333333 | 40.000000 | 1.000000 | 1.000000 | 26.346494 | 1.000000 | 0.000000 |
| 75% | 58.000000 | 1.000000 | 170.000000 | 82.000000 | 140.000000 | 90.000000 | 103.333333 | 50.000000 | 1.000000 | 1.000000 | 30.094730 | 1.000000 | 1.000000 |
| max | 65.000000 | 1.000000 | 187.000000 | 116.000000 | 309.000000 | 182.000000 | 186.666667 | 227.727435 | 3.000000 | 3.000000 | 55.459105 | 1.000000 | 1.000000 |
#In order to make the diabetes dataset good fit to our cardio dataset, we would like to drop the objects aged below 39 in diabetes dataset
data.drop(data[data['year']<39].index, inplace=True)
data.describe()
| Pregnancies | Glucose | map | SkinThickness | Insulin | bmi | DiabetesPedigreeFunction | year | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 |
| mean | 6.657534 | 131.456621 | 77.650685 | 30.666667 | 155.511416 | 33.013699 | 0.474548 | 49.045662 | 0.506849 |
| std | 3.537035 | 32.280333 | 10.932241 | 8.324460 | 92.092189 | 6.153413 | 0.323434 | 8.579243 | 0.501098 |
| min | 0.000000 | 57.000000 | 50.000000 | 7.000000 | 22.000000 | 19.600000 | 0.085000 | 39.000000 | 0.000000 |
| 25% | 5.000000 | 106.000000 | 71.000000 | 27.000000 | 102.500000 | 28.750000 | 0.241000 | 42.000000 | 0.000000 |
| 50% | 7.000000 | 129.000000 | 76.000000 | 30.000000 | 145.000000 | 32.900000 | 0.376000 | 46.000000 | 1.000000 |
| 75% | 9.000000 | 154.000000 | 84.000000 | 32.000000 | 169.500000 | 37.100000 | 0.643500 | 54.000000 | 1.000000 |
| max | 17.000000 | 197.000000 | 114.000000 | 99.000000 | 846.000000 | 52.300000 | 1.781000 | 81.000000 | 1.000000 |
Now, the mean value of year in both datasets are 49 and 53, which are almost the same. Besides, the range of the year of our diabetes dataset covers that of our cardio dataset.
#check
data.head(2)
| Pregnancies | Glucose | map | SkinThickness | Insulin | bmi | DiabetesPedigreeFunction | year | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 |
| 8 | 2 | 197.0 | 70.0 | 45.0 | 543.0 | 30.5 | 0.158 | 53 | 1 |
data['Outcome'].value_counts()
1 111 0 108 Name: Outcome, dtype: int64
This shows the target value---Outcome in diabetes dataset is balanced
#check
data_for_merge= pd.DataFrame(df[['map','bmi','year']])
data_for_merge.head(1)
| map | bmi | year | |
|---|---|---|---|
| 0 | 90.0 | 21.96712 | 50 |
# Now, it is time to train the model, here we use randomforest classifier
# Extract Response and Predictors
y = pd.DataFrame(data['Outcome'])
X = pd.DataFrame(data[['map','bmi','year']])
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=99)
#adjust the max_depth parameter to hance the accuracy
score1=[]
for i in range(0,30):
rfc1=RandomForestClassifier(max_depth=i+1,n_jobs=-1,random_state=90)
rfc1.fit(X_train,y_train)
pred = rfc1.predict(X_test)
score = 100*np.round(accuracy_score(y_test,pred),decimals=4)
score1.append(score)
print(" The best accuracy is" , max(score1))
print(" Occur at max_depth = ",(score1.index(max(score1))+1))
plt.figure(figsize=[20,5])
n_est=(score1.index(max(score1))+1)
C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/2401713133.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
The best accuracy is 65.14999999999999 Occur at max_depth = 5
<Figure size 1440x360 with 0 Axes>
RFC=RandomForestClassifier(n_jobs=-1,random_state=90,max_depth=5)
RFC.fit(X_train,y_train)
pred = RFC.predict(X_test)
score = 100*np.round(accuracy_score(y_test,pred),decimals=4)
print("The accuracy is ", score,"%")
The accuracy is 65.14999999999999 %
C:\Users\YONGHA~1\AppData\Local\Temp/ipykernel_6308/1148034922.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
pred_merge = RFC.predict(data_for_merge)
pred_merge = pd.DataFrame(pred_merge, columns = ["Diabetes"], index = df.index)
df.insert(12,'diabetes',pred_merge)
df.head(10)
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | diabetes | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | 1 | 168 | 62.0 | 110.0 | 80.0 | 90.000000 | 30.0 | 1 | 1 | 21.967120 | 1 | 0 | 0 |
| 1 | 55 | 0 | 156 | 85.0 | 140.0 | 90.0 | 106.666667 | 50.0 | 3 | 1 | 34.927679 | 1 | 1 | 1 |
| 2 | 52 | 0 | 165 | 64.0 | 130.0 | 70.0 | 90.000000 | 60.0 | 3 | 1 | 23.507805 | 0 | 0 | 1 |
| 3 | 48 | 1 | 169 | 82.0 | 150.0 | 100.0 | 116.666667 | 50.0 | 1 | 1 | 28.710479 | 1 | 1 | 1 |
| 4 | 48 | 0 | 156 | 56.0 | 100.0 | 60.0 | 73.333333 | 40.0 | 1 | 1 | 23.011177 | 0 | 0 | 0 |
| 5 | 60 | 0 | 151 | 67.0 | 120.0 | 80.0 | 93.333333 | 40.0 | 2 | 2 | 29.384676 | 0 | 0 | 0 |
| 6 | 61 | 0 | 157 | 93.0 | 130.0 | 80.0 | 96.666667 | 50.0 | 3 | 1 | 37.729725 | 1 | 1 | 0 |
| 7 | 62 | 1 | 178 | 95.0 | 130.0 | 90.0 | 103.333333 | 40.0 | 3 | 3 | 29.983588 | 1 | 1 | 1 |
| 8 | 48 | 0 | 158 | 71.0 | 110.0 | 70.0 | 83.333333 | 40.0 | 1 | 1 | 28.440955 | 1 | 1 | 0 |
| 9 | 54 | 0 | 164 | 68.0 | 110.0 | 60.0 | 76.666667 | 50.0 | 1 | 1 | 25.282570 | 0 | 1 | 0 |
For Model Training and Testing, we initialised and tuned 7 different models and executed a model ensembling technique to combine the decisions from our models and improve the overall performance.
Models tested out\:
df
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | diabetes | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | 1 | 168 | 62.0 | 110.0 | 80.0 | 90.000000 | 30.0 | 1 | 1 | 21.967120 | 1 | 0 | 0 |
| 1 | 55 | 0 | 156 | 85.0 | 140.0 | 90.0 | 106.666667 | 50.0 | 3 | 1 | 34.927679 | 1 | 1 | 1 |
| 2 | 52 | 0 | 165 | 64.0 | 130.0 | 70.0 | 90.000000 | 60.0 | 3 | 1 | 23.507805 | 0 | 0 | 1 |
| 3 | 48 | 1 | 169 | 82.0 | 150.0 | 100.0 | 116.666667 | 50.0 | 1 | 1 | 28.710479 | 1 | 1 | 1 |
| 4 | 48 | 0 | 156 | 56.0 | 100.0 | 60.0 | 73.333333 | 40.0 | 1 | 1 | 23.011177 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69994 | 58 | 0 | 165 | 80.0 | 150.0 | 80.0 | 103.333333 | 70.0 | 1 | 1 | 29.384757 | 1 | 1 | 1 |
| 69995 | 53 | 1 | 168 | 76.0 | 120.0 | 80.0 | 93.333333 | 40.0 | 1 | 1 | 26.927438 | 1 | 1 | 0 |
| 69997 | 52 | 1 | 183 | 105.0 | 180.0 | 90.0 | 120.000000 | 90.0 | 3 | 1 | 31.353579 | 0 | 1 | 1 |
| 69998 | 61 | 0 | 163 | 72.0 | 135.0 | 80.0 | 98.333333 | 55.0 | 1 | 2 | 27.099251 | 0 | 1 | 1 |
| 69999 | 56 | 0 | 170 | 72.0 | 120.0 | 80.0 | 93.333333 | 40.0 | 2 | 1 | 24.913495 | 1 | 1 | 0 |
68336 rows × 14 columns
# Only include contribute to model improvements
df_train = df[['year', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'map', 'pp', 'cholesterol', 'gluc', 'bmi', 'active', 'diabetes', 'cardio']]
df_train
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | diabetes | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | 1 | 168 | 62.0 | 110.0 | 80.0 | 90.000000 | 30.0 | 1 | 1 | 21.967120 | 1 | 0 | 0 |
| 1 | 55 | 0 | 156 | 85.0 | 140.0 | 90.0 | 106.666667 | 50.0 | 3 | 1 | 34.927679 | 1 | 1 | 1 |
| 2 | 52 | 0 | 165 | 64.0 | 130.0 | 70.0 | 90.000000 | 60.0 | 3 | 1 | 23.507805 | 0 | 0 | 1 |
| 3 | 48 | 1 | 169 | 82.0 | 150.0 | 100.0 | 116.666667 | 50.0 | 1 | 1 | 28.710479 | 1 | 1 | 1 |
| 4 | 48 | 0 | 156 | 56.0 | 100.0 | 60.0 | 73.333333 | 40.0 | 1 | 1 | 23.011177 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69994 | 58 | 0 | 165 | 80.0 | 150.0 | 80.0 | 103.333333 | 70.0 | 1 | 1 | 29.384757 | 1 | 1 | 1 |
| 69995 | 53 | 1 | 168 | 76.0 | 120.0 | 80.0 | 93.333333 | 40.0 | 1 | 1 | 26.927438 | 1 | 1 | 0 |
| 69997 | 52 | 1 | 183 | 105.0 | 180.0 | 90.0 | 120.000000 | 90.0 | 3 | 1 | 31.353579 | 0 | 1 | 1 |
| 69998 | 61 | 0 | 163 | 72.0 | 135.0 | 80.0 | 98.333333 | 55.0 | 1 | 2 | 27.099251 | 0 | 1 | 1 |
| 69999 | 56 | 0 | 170 | 72.0 | 120.0 | 80.0 | 93.333333 | 40.0 | 2 | 1 | 24.913495 | 1 | 1 | 0 |
68336 rows × 14 columns
df_train.columns
Index(['year', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'map', 'pp',
'cholesterol', 'gluc', 'bmi', 'active', 'diabetes', 'cardio'],
dtype='object')
# Spliting data into train set and test set
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(df_train.drop('cardio',axis=1),df_train.cardio,test_size=0.20, random_state=42)
# Scaling data:
# in many machine learning algorithms, to bring all features in the same standing,
# we need to do scaling so that one significant number doesn’t impact the model just because of their large magnitude.
to_be_scaled_feat = ['year', 'height', 'weight', 'ap_hi', 'ap_lo', 'map', 'pp', 'bmi']
scaler=StandardScaler()
scaler.fit(X_train[to_be_scaled_feat])
# saving the scaler
filename = 'scaler.sav'
pickle.dump(scaler, open(filename, 'wb'))
X_train[to_be_scaled_feat] = scaler.transform(X_train[to_be_scaled_feat])
X_test[to_be_scaled_feat] = scaler.transform(X_test[to_be_scaled_feat])
# Import DecisionTreeClassifier model from Scikit-Learn
from sklearn.tree import DecisionTreeClassifier
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model
# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
Goodness of Fit of Model Train Dataset Classification Accuracy : 0.7132143118460526 Goodness of Fit of Model Test Dataset Classification Accuracy : 0.7147351477904594
This is not good enough, we shall improving decision tree classifier by tuning max_depth
train_results = []
test_results = []
for x in range(1, 21):
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = x) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
train_results.append(dectree.score(X_train, y_train))
test_results.append(dectree.score(X_test, y_test))
best_max_depth = np.argmax(test_results) + 1
print('Best parameter is max_depth =', best_max_depth, 'with test classification accuracy =', np.max(test_results))
fig, axes = plt.subplots(1, 1, figsize=(12, 8))
sns.lineplot(x=[x for x in range(1, 21)], y=train_results, label='train')
sns.lineplot(x=[x for x in range(1, 21)], y=test_results, label='test')
plt.legend()
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.show()
Best parameter is max_depth = 6 with test classification accuracy = 0.7297336845185836
# Getting our final improved model
dectree = DecisionTreeClassifier(max_depth = best_max_depth) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model
# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
# Plot the Confusion Matrix for Train and Test
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred, digits=4))
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7087 0.7866 0.7457 6880
1 0.7566 0.6724 0.7120 6788
accuracy 0.7299 13668
macro avg 0.7327 0.7295 0.7288 13668
weighted avg 0.7325 0.7299 0.7289 13668
<AxesSubplot:title={'center':'Test Set'}>
# Plot the trained Decision Tree
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(20, 20))
out = plot_tree(dectree,
feature_names = X_train.columns,
class_names = [str(x) for x in dectree.classes_],
filled=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
dectree.feature_importances_
array([1.21849517e-01, 0.00000000e+00, 1.23098803e-03, 6.15060849e-03,
7.68999886e-01, 5.06503902e-04, 1.39559120e-03, 4.90818698e-03,
7.16704194e-02, 5.99808881e-03, 1.53754144e-02, 1.91479578e-03,
0.00000000e+00])
# Plotting feature importances
importance_dectree=pd.DataFrame([*zip(X_train.columns,dectree.feature_importances_)],columns=['feature_name','importance'])
plot_order = importance_dectree['importance'].sort_values(ascending=False).index.values
ls = []
for x in plot_order:
ls.append(importance_dectree['feature_name'][x])
fig, ax = plt.subplots(figsize=(20, 20))
sns.barplot(x = "feature_name", y="importance",data = importance_dectree, order=ls)
<AxesSubplot:xlabel='feature_name', ylabel='importance'>
This model fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
# tuning n_estimators for best performance
train_results = []
test_results = []
for x in range(1, 101):
# Random Forest using Train Data
rfc = RandomForestClassifier(n_estimators=x,n_jobs=-1,random_state=90) # create the random forest object
rfc.fit(X_train, y_train) # train the random forest model
# Predict Response corresponding to Predictors
y_train_pred = rfc.predict(X_train)
y_test_pred = rfc.predict(X_test)
train_results.append(rfc.score(X_train, y_train))
test_results.append(rfc.score(X_test, y_test))
best_n_est = np.argmax(test_results) + 1
print('Best parameter is n_estimators =', best_n_est, 'with test classification accuracy =', np.max(test_results))
fig, axes = plt.subplots(1, 1, figsize=(12, 8))
sns.lineplot(x=[x for x in range(1, 101)], y=train_results, label='train')
sns.lineplot(x=[x for x in range(1, 101)], y=test_results, label='test')
plt.legend()
plt.xlabel('n_estimators')
plt.ylabel('accuracy')
plt.show()
Best parameter is n_estimators = 99 with test classification accuracy = 0.7032484635645303
# tuning max_depth for best performance
train_results = []
test_results = []
for x in range(1, 21):
# Random Forest using Train Data
rfc = RandomForestClassifier(max_depth=x,n_jobs=-1,random_state=90) # create the random forest object
rfc.fit(X_train, y_train) # train the random forest model
# Predict Response corresponding to Predictors
y_train_pred = rfc.predict(X_train)
y_test_pred = rfc.predict(X_test)
train_results.append(rfc.score(X_train, y_train))
test_results.append(rfc.score(X_test, y_test))
best_depth = np.argmax(test_results) + 1
print('Best parameter is max_depth =', best_depth, 'with test classification accuracy =', np.max(test_results))
fig, axes = plt.subplots(1, 1, figsize=(12, 8))
sns.lineplot(x=[x for x in range(1, 21)], y=train_results, label='train')
sns.lineplot(x=[x for x in range(1, 21)], y=test_results, label='test')
plt.legend()
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.show()
Best parameter is max_depth = 11 with test classification accuracy = 0.7342698273339187
# tuning min_samples_split for best performance
train_results = []
test_results = []
for x in range(40, 71):
# Random Forest using Train Data
rfc = RandomForestClassifier(min_samples_split=x,n_jobs=-1,random_state=90) # create the random forest object
rfc.fit(X_train, y_train) # train the random forest model
# Predict Response corresponding to Predictors
y_train_pred = rfc.predict(X_train)
y_test_pred = rfc.predict(X_test)
train_results.append(rfc.score(X_train, y_train))
test_results.append(rfc.score(X_test, y_test))
best_mss = np.argmax(test_results) + 40
print('Best parameter is min_samples_split =', best_mss, 'with test classification accuracy =', np.max(test_results))
fig, axes = plt.subplots(1, 1, figsize=(12, 8))
sns.lineplot(x=[x for x in range(40, 71)], y=train_results, label='train')
sns.lineplot(x=[x for x in range(40, 71)], y=test_results, label='test')
plt.legend()
plt.xlabel('min_samples_split')
plt.ylabel('accuracy')
plt.show()
Best parameter is min_samples_split = 42 with test classification accuracy = 0.7333187006145742
We realised max_depth gave the best test classification accuracy
# Getting best performing model
rfc=RandomForestClassifier(max_depth=best_depth, n_jobs=-1,random_state=90)
rfc.fit(X_train,y_train)
pred = rfc.predict(X_test)
# Plot the Confusion Matrix for Train and Test
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(classification_report(y_test, pred, digits=4))
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7159 0.7827 0.7478 6880
1 0.7568 0.6852 0.7192 6788
accuracy 0.7343 13668
macro avg 0.7363 0.7339 0.7335 13668
weighted avg 0.7362 0.7343 0.7336 13668
<AxesSubplot:title={'center':'Test Set'}>
rfc.feature_importances_
array([0.10933635, 0.00596307, 0.0319506 , 0.04093034, 0.27043486,
0.07722145, 0.19826993, 0.10985191, 0.07239483, 0.01258894,
0.05610712, 0.00909823, 0.00585238])
X_train.columns
Index(['year', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'map', 'pp',
'cholesterol', 'gluc', 'bmi', 'active', 'diabetes'],
dtype='object')
# Plot feature importances
importance_ramdomforest=pd.DataFrame([*zip(X_train.columns,rfc.feature_importances_)],columns=['feature_name','importance'])
plot_order = importance_ramdomforest['importance'].sort_values(ascending=False).index.values
ls = []
for x in plot_order:
ls.append(importance_ramdomforest['feature_name'][x])
fig, ax = plt.subplots(figsize=(20, 20))
sns.barplot(x = "feature_name", y="importance",data = importance_ramdomforest, order=ls)
<AxesSubplot:xlabel='feature_name', ylabel='importance'>
XGBoost stands for eXtreme Gradient Boosting. It is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
def hyperopt_xgb_score(params):
clf = XGBClassifier(**params)
clf.fit(X_train, y_train)
xgb_pred = clf.predict(X_test)
current_score = accuracy_score(y_test, xgb_pred)
print(current_score, params)
loss = 1 - current_score # we set the loss as 1-score to enable fmin to minimise it
return loss
space_xgb = {
'learning_rate': hp.quniform('learning_rate', 0, 0.05, 0.0001),
'n_estimators': hp.choice('n_estimators', range(1, 1000)),
'eta': hp.quniform('eta', 0.025, 0.5, 0.005),
'max_depth': hp.choice('max_depth', np.arange(2, 12, dtype=int)),
'min_child_weight': hp.quniform('min_child_weight', 1, 9, 0.025),
'subsample': hp.quniform('subsample', 0.5, 1, 0.005),
'gamma': hp.quniform('gamma', 0.5, 1, 0.005),
'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 1, 0.005),
'eval_metric': 'auc',
'objective': 'binary:logistic',
'booster': 'gbtree',
'tree_method': 'exact',
'missing': None
}
best = fmin(fn=hyperopt_xgb_score, space=space_xgb, algo=tpe.suggest, max_evals=20)
0.7334650278021656
{'booster': 'gbtree', 'colsample_bytree': 0.645, 'eta': 0.425, 'eval_metric': 'auc', 'gamma': 0.925, 'learning_rate': 0.0366, 'max_depth': 5, 'min_child_weight': 8.85, 'missing': None, 'n_estimators': 467, 'objective': 'binary:logistic', 'subsample': 0.625, 'tree_method': 'exact'}
0.7302458296751536
{'booster': 'gbtree', 'colsample_bytree': 0.625, 'eta': 0.31, 'eval_metric': 'auc', 'gamma': 0.68, 'learning_rate': 0.0036000000000000003, 'max_depth': 3, 'min_child_weight': 3.9250000000000003, 'missing': None, 'n_estimators': 800, 'objective': 'binary:logistic', 'subsample': 0.92, 'tree_method': 'exact'}
0.7232221246707639
{'booster': 'gbtree', 'colsample_bytree': 0.865, 'eta': 0.425, 'eval_metric': 'auc', 'gamma': 0.8200000000000001, 'learning_rate': 0.0296, 'max_depth': 2, 'min_child_weight': 6.625, 'missing': None, 'n_estimators': 60, 'objective': 'binary:logistic', 'subsample': 0.96, 'tree_method': 'exact'}
0.7249780509218613
{'booster': 'gbtree', 'colsample_bytree': 0.8200000000000001, 'eta': 0.14, 'eval_metric': 'auc', 'gamma': 0.98, 'learning_rate': 0.041600000000000005, 'max_depth': 8, 'min_child_weight': 7.800000000000001, 'missing': None, 'n_estimators': 997, 'objective': 'binary:logistic', 'subsample': 0.7000000000000001, 'tree_method': 'exact'}
0.7329528826455955
{'booster': 'gbtree', 'colsample_bytree': 0.64, 'eta': 0.145, 'eval_metric': 'auc', 'gamma': 0.8250000000000001, 'learning_rate': 0.0429, 'max_depth': 10, 'min_child_weight': 8.875, 'missing': None, 'n_estimators': 159, 'objective': 'binary:logistic', 'subsample': 0.5650000000000001, 'tree_method': 'exact'}
0.7311237928007024
{'booster': 'gbtree', 'colsample_bytree': 0.705, 'eta': 0.33, 'eval_metric': 'auc', 'gamma': 0.5750000000000001, 'learning_rate': 0.0145, 'max_depth': 2, 'min_child_weight': 4.9750000000000005, 'missing': None, 'n_estimators': 666, 'objective': 'binary:logistic', 'subsample': 0.9400000000000001, 'tree_method': 'exact'}
0.7303189932689493
{'booster': 'gbtree', 'colsample_bytree': 0.655, 'eta': 0.28, 'eval_metric': 'auc', 'gamma': 0.685, 'learning_rate': 0.0045000000000000005, 'max_depth': 4, 'min_child_weight': 8.3, 'missing': None, 'n_estimators': 313, 'objective': 'binary:logistic', 'subsample': 0.765, 'tree_method': 'exact'}
0.7257828504536142
{'booster': 'gbtree', 'colsample_bytree': 0.74, 'eta': 0.075, 'eval_metric': 'auc', 'gamma': 0.885, 'learning_rate': 0.046200000000000005, 'max_depth': 10, 'min_child_weight': 3.475, 'missing': None, 'n_estimators': 473, 'objective': 'binary:logistic', 'subsample': 0.715, 'tree_method': 'exact'}
0.727758267486099
{'booster': 'gbtree', 'colsample_bytree': 0.935, 'eta': 0.075, 'eval_metric': 'auc', 'gamma': 0.85, 'learning_rate': 0.0322, 'max_depth': 10, 'min_child_weight': 5.3500000000000005, 'missing': None, 'n_estimators': 604, 'objective': 'binary:logistic', 'subsample': 0.8, 'tree_method': 'exact'}
0.733611354989757
{'booster': 'gbtree', 'colsample_bytree': 0.595, 'eta': 0.325, 'eval_metric': 'auc', 'gamma': 0.985, 'learning_rate': 0.01, 'max_depth': 8, 'min_child_weight': 3.5500000000000003, 'missing': None, 'n_estimators': 309, 'objective': 'binary:logistic', 'subsample': 0.685, 'tree_method': 'exact'}
0.7329528826455955
{'booster': 'gbtree', 'colsample_bytree': 0.9400000000000001, 'eta': 0.425, 'eval_metric': 'auc', 'gamma': 0.9500000000000001, 'learning_rate': 0.0347, 'max_depth': 6, 'min_child_weight': 1.225, 'missing': None, 'n_estimators': 524, 'objective': 'binary:logistic', 'subsample': 0.6, 'tree_method': 'exact'}
0.7339771729587358
{'booster': 'gbtree', 'colsample_bytree': 0.93, 'eta': 0.255, 'eval_metric': 'auc', 'gamma': 0.515, 'learning_rate': 0.0005, 'max_depth': 6, 'min_child_weight': 7.025, 'missing': None, 'n_estimators': 623, 'objective': 'binary:logistic', 'subsample': 0.545, 'tree_method': 'exact'}
0.7311969563944981
{'booster': 'gbtree', 'colsample_bytree': 0.965, 'eta': 0.225, 'eval_metric': 'auc', 'gamma': 0.55, 'learning_rate': 0.0207, 'max_depth': 6, 'min_child_weight': 1.6500000000000001, 'missing': None, 'n_estimators': 272, 'objective': 'binary:logistic', 'subsample': 0.72, 'tree_method': 'exact'}
0.7324407374890255
{'booster': 'gbtree', 'colsample_bytree': 0.765, 'eta': 0.295, 'eval_metric': 'auc', 'gamma': 0.99, 'learning_rate': 0.041, 'max_depth': 3, 'min_child_weight': 3.1, 'missing': None, 'n_estimators': 582, 'objective': 'binary:logistic', 'subsample': 0.995, 'tree_method': 'exact'}
0.7325139010828212
{'booster': 'gbtree', 'colsample_bytree': 0.73, 'eta': 0.035, 'eval_metric': 'auc', 'gamma': 0.74, 'learning_rate': 0.0339, 'max_depth': 5, 'min_child_weight': 4.375, 'missing': None, 'n_estimators': 175, 'objective': 'binary:logistic', 'subsample': 0.765, 'tree_method': 'exact'}
0.733099209833187
{'booster': 'gbtree', 'colsample_bytree': 0.96, 'eta': 0.15, 'eval_metric': 'auc', 'gamma': 0.91, 'learning_rate': 0.0475, 'max_depth': 4, 'min_child_weight': 8.950000000000001, 'missing': None, 'n_estimators': 152, 'objective': 'binary:logistic', 'subsample': 0.545, 'tree_method': 'exact'}
0.7292947029558092
{'booster': 'gbtree', 'colsample_bytree': 0.615, 'eta': 0.425, 'eval_metric': 'auc', 'gamma': 0.595, 'learning_rate': 0.0009000000000000001, 'max_depth': 5, 'min_child_weight': 1.5250000000000001, 'missing': None, 'n_estimators': 433, 'objective': 'binary:logistic', 'subsample': 0.725, 'tree_method': 'exact'}
0.7301726660813579
{'booster': 'gbtree', 'colsample_bytree': 0.53, 'eta': 0.295, 'eval_metric': 'auc', 'gamma': 0.6, 'learning_rate': 0.0085, 'max_depth': 9, 'min_child_weight': 3.0, 'missing': None, 'n_estimators': 149, 'objective': 'binary:logistic', 'subsample': 0.935, 'tree_method': 'exact'}
0.7280509218612818
{'booster': 'gbtree', 'colsample_bytree': 0.9, 'eta': 0.06, 'eval_metric': 'auc', 'gamma': 0.61, 'learning_rate': 0.0495, 'max_depth': 7, 'min_child_weight': 2.5, 'missing': None, 'n_estimators': 672, 'objective': 'binary:logistic', 'subsample': 0.725, 'tree_method': 'exact'}
0.72585601404741
{'booster': 'gbtree', 'colsample_bytree': 0.88, 'eta': 0.195, 'eval_metric': 'auc', 'gamma': 0.55, 'learning_rate': 0.0021000000000000003, 'max_depth': 3, 'min_child_weight': 5.95, 'missing': None, 'n_estimators': 170, 'objective': 'binary:logistic', 'subsample': 0.86, 'tree_method': 'exact'}
100%|███████████████████████████████████████████████| 20/20 [01:38<00:00, 4.94s/trial, best loss: 0.26602282704126423]
print('The best parameters of XgBoost:')
print(best)
The best parameters of XgBoost:
{'colsample_bytree': 0.93, 'eta': 0.255, 'gamma': 0.515, 'learning_rate': 0.0005, 'max_depth': 4, 'min_child_weight': 7.025, 'n_estimators': 622, 'subsample': 0.545}
params = space_eval(space_xgb, best)
params
{'booster': 'gbtree',
'colsample_bytree': 0.93,
'eta': 0.255,
'eval_metric': 'auc',
'gamma': 0.515,
'learning_rate': 0.0005,
'max_depth': 6,
'min_child_weight': 7.025,
'missing': None,
'n_estimators': 623,
'objective': 'binary:logistic',
'subsample': 0.545,
'tree_method': 'exact'}
After a few trials, we obtained our best set of parameters below. As hyperopt samples parameters uniformly from our parameter space, the best parameter will be different for every run. To make our results reproducible, we provided our best parameters below.
best_params = {'booster': 'gbtree',
'colsample_bytree': 0.74,
'eta': 0.25,
'eval_metric': 'auc',
'gamma': 0.665,
'learning_rate': 0.0016,
'max_depth': 11,
'min_child_weight': 5.5,
'missing': None,
'n_estimators': 673,
'objective': 'binary:logistic',
'subsample': 0.58,
'tree_method': 'exact'}
warnings.filterwarnings("ignore", category=UserWarning)
XGB_Classifier = XGBClassifier(**best_params)
XGB_Classifier.fit(X_train, y_train)
acc_XGB_Classifier = round(XGB_Classifier.score(X_train, y_train) * 100, 2)
# Check train accuracy
acc_XGB_Classifier
75.42
# Make prediction on test set
pred = XGB_Classifier.predict(X_test)
# Plot the Confusion Matrix for Train and Test
print(classification_report(y_test, pred, digits=4))
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_test, pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7175 0.7810 0.7479 6880
1 0.7561 0.6884 0.7207 6788
accuracy 0.7350 13668
macro avg 0.7368 0.7347 0.7343 13668
weighted avg 0.7367 0.7350 0.7344 13668
<AxesSubplot:title={'center':'Test Set'}>
# Plot feature importances
fig = plt.figure(figsize = (15,15))
axes = fig.add_subplot(111)
xgb.plot_importance(XGB_Classifier,ax = axes,height =0.5)
plt.show();
plt.close()
We can see that XGBoost is giving us the best result out of the 3 decision-tree-based models, this is the expected result as XGBoost is utilising the gradient boosting method to minimise loss. This method is based on iterative learning which means the model will predict something initially and analyses its mistakes as a predictive toiler and give more weightage to the data points in which it made a wrong prediction in the next iteration. This makes the process more systematic than random samples for the decision trees in random forest, and hence leads to better results. We will include only XGBoost as the representative for tree-based models in our final ensembled model.
# Saving xgb model
filename = 'xgb.sav'
pickle.dump(XGB_Classifier, open(filename, 'wb'))
The K-Nearest Neighbours algorithm captures the similarity between the data point to be classified and its k number of nearest neighbours determined by calculating distance between them. Among the k nearest neighbours, a voting mechanism is implemented and the majority will be designated as the data point's class.

Image showing data point(blue star) being classified as 'red circle' instead of 'green square'
# implementing elbow method to find best k
error_rate = []
for i in range(1,60):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,60),error_rate,color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:",min(error_rate),"at K =",error_rate.index(min(error_rate)))
Minimum error: 0.27158326016973955 at K = 44
knn = KNeighborsClassifier(n_neighbors=44)
knn.fit(X_train,y_train)
knn_pred = knn.predict(X_test)
# Plot the Confusion Matrix for Train and Test
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(classification_report(y_test, knn_pred, digits=4))
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_test, knn_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7078 0.7810 0.7426 6880
1 0.7520 0.6732 0.7105 6788
accuracy 0.7275 13668
macro avg 0.7299 0.7271 0.7265 13668
weighted avg 0.7298 0.7275 0.7266 13668
<AxesSubplot:title={'center':'Test Set'}>
# Saving knn model
filename = 'knn.sav'
pickle.dump(knn, open(filename, 'wb'))
The Support Vector Machine algorithm is used to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. A good hyperplane has large margins to faciliate easier classification of points in the future.

svc = SVC(probability=True)
svc.fit(X_train,y_train)
svc_pred = svc.predict(X_test)
# Plot the Confusion Matrix for Train and Test
print(classification_report(y_test, svc_pred, digits=4))
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_test, svc_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7101 0.7817 0.7442 6880
1 0.7535 0.6765 0.7129 6788
accuracy 0.7294 13668
macro avg 0.7318 0.7291 0.7285 13668
weighted avg 0.7316 0.7294 0.7286 13668
<AxesSubplot:title={'center':'Test Set'}>
# Saving svc model
filename = 'svc.sav'
pickle.dump(svc, open(filename, 'wb'))
# Since svc does not have a feature_importances_ function, we will use permutation_imporatnce from sklearn instead
# This method changes the data in a feature column and then tests how much that affects the model accuracy
perm_importance = permutation_importance(svc, X_test, y_test)
feature_names = X_train.columns
features = np.array(feature_names)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
Text(0.5, 0, 'Permutation Importance')
Neural network is inspired by the biological brain, simulating neurons connecting to one another forming a complex network of neurons. A Deep Neural Network can be represented as a hierarchical (layered) organization of neurons with connections to other neurons. These neurons pass a message or signal to other neurons based on the received input and form a complex network that learns with some feedback mechanism.

# Number of features
X_train.shape
(54668, 13)
Dense Layers:Used for changing the dimension of the vectors by using every neuron, where all neurons of this layer is connected to all neurons of the preceding layer.
Activation Layers: Introduces the non-linearity into the networks of neural networks so that the networks can learn the relationship between the input and output values.
Dropout layers: Prevents overfitting by randomly ignoring or “dropping out” some number of layer outputs.
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.20)
nn = Sequential()
nn.add(Dense(13,activation='relu')) # number of neurons = number of features
nn.add(Dense(50,activation='relu',kernel_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1),
bias_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1)))
nn.add(Dropout(0.2))
nn.add(Dense(50,activation='relu',kernel_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1),
bias_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1)))
nn.add(Dropout(0.2))
nn.add(Dense(50,activation='relu',kernel_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1),
bias_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1)))
nn.add(Dropout(0.2))
nn.add(Dense(1,activation='sigmoid'))
nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
nn.fit(x=X_train.values,y=y_train.values,
validation_data=(X_val,y_val.values),
batch_size=100,epochs=150,callbacks=[early_stop])
Epoch 1/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5765 - accuracy: 0.7111 - val_loss: 0.5536 - val_accuracy: 0.7312 Epoch 2/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5575 - accuracy: 0.7262 - val_loss: 0.5503 - val_accuracy: 0.7318 Epoch 3/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5540 - accuracy: 0.7268 - val_loss: 0.5476 - val_accuracy: 0.7316 Epoch 4/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5532 - accuracy: 0.7277 - val_loss: 0.5458 - val_accuracy: 0.7353 Epoch 5/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5527 - accuracy: 0.7277 - val_loss: 0.5462 - val_accuracy: 0.7335 Epoch 6/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5516 - accuracy: 0.7284 - val_loss: 0.5457 - val_accuracy: 0.7342 Epoch 7/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5517 - accuracy: 0.7287 - val_loss: 0.5458 - val_accuracy: 0.7336 Epoch 8/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5506 - accuracy: 0.7295 - val_loss: 0.5471 - val_accuracy: 0.7305 Epoch 9/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5507 - accuracy: 0.7294 - val_loss: 0.5442 - val_accuracy: 0.7337 Epoch 10/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5506 - accuracy: 0.7295 - val_loss: 0.5457 - val_accuracy: 0.7361 Epoch 11/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5492 - accuracy: 0.7300 - val_loss: 0.5442 - val_accuracy: 0.7348 Epoch 12/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5500 - accuracy: 0.7303 - val_loss: 0.5441 - val_accuracy: 0.7353 Epoch 13/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5489 - accuracy: 0.7299 - val_loss: 0.5433 - val_accuracy: 0.7372 Epoch 14/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5493 - accuracy: 0.7305 - val_loss: 0.5457 - val_accuracy: 0.7317 Epoch 15/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5484 - accuracy: 0.7310 - val_loss: 0.5433 - val_accuracy: 0.7350 Epoch 16/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5493 - accuracy: 0.7290 - val_loss: 0.5440 - val_accuracy: 0.7308 Epoch 17/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5487 - accuracy: 0.7313 - val_loss: 0.5449 - val_accuracy: 0.7355 Epoch 18/150 438/438 [==============================] - 1s 2ms/step - loss: 0.5473 - accuracy: 0.7303 - val_loss: 0.5439 - val_accuracy: 0.7357 Epoch 00018: early stopping
<keras.callbacks.History at 0x267183efc70>
# print model structure
nn.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 13) 182
dense_1 (Dense) (None, 50) 700
dropout (Dropout) (None, 50) 0
dense_2 (Dense) (None, 50) 2550
dropout_1 (Dropout) (None, 50) 0
dense_3 (Dense) (None, 50) 2550
dropout_2 (Dropout) (None, 50) 0
dense_4 (Dense) (None, 1) 51
=================================================================
Total params: 6,033
Trainable params: 6,033
Non-trainable params: 0
_________________________________________________________________
losses = pd.DataFrame(nn.history.history)
losses[['loss','val_loss']].plot();
threshold = 0.5
dnn_pred = nn.predict(X_test)
dnn_pred = np.where(dnn_pred > threshold, 1,0)
print(classification_report(y_test,dnn_pred, digits=4))
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_test, dnn_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7304 0.7330 0.7317 6880
1 0.7284 0.7258 0.7271 6788
accuracy 0.7294 13668
macro avg 0.7294 0.7294 0.7294 13668
weighted avg 0.7294 0.7294 0.7294 13668
<AxesSubplot:title={'center':'Test Set'}>
# Saving neural network model
nn.save("nnmodel")
INFO:tensorflow:Assets written to: nnmodel\assets
Steps of Blending
print('size of X_test: ', X_test.shape)
print('size of y_test: ', y_test.shape)
size of X_test: (13668, 13) size of y_test: (13668,)
# we take 30% of this as test set
split = round(X_test.shape[0]*0.7)
split
9568
X_valme = X_test[:split].reset_index().drop('index', axis=1)
X_testme = X_test[split:].reset_index().drop('index', axis=1)
y_valme = y_test[:split].reset_index().drop('index', axis=1)
y_testme = y_test[split:].reset_index().drop('index', axis=1)
val_pred1=XGB_Classifier.predict_proba(X_valme)
test_pred1=XGB_Classifier.predict_proba(X_testme)
val_pred1=pd.DataFrame(pd.DataFrame(val_pred1)[1])
test_pred1=pd.DataFrame(pd.DataFrame(test_pred1)[1])
val_pred2=knn.predict_proba(X_valme)
test_pred2=knn.predict_proba(X_testme)
val_pred2=pd.DataFrame(pd.DataFrame(val_pred2)[1])
val_pred2.rename(columns={1:'2'}, inplace=True)
test_pred2=pd.DataFrame(pd.DataFrame(test_pred2)[1])
test_pred2.rename(columns={1:'2'}, inplace=True)
val_pred3=svc.predict_proba(X_valme)
test_pred3=svc.predict_proba(X_testme)
val_pred3=pd.DataFrame(pd.DataFrame(val_pred3)[1])
val_pred3.rename(columns={1:'3'}, inplace=True)
test_pred3=pd.DataFrame(pd.DataFrame(test_pred3)[1])
test_pred3.rename(columns={1:'3'}, inplace=True)
val_pred4=nn.predict(X_valme)
test_pred4=nn.predict(X_testme)
val_pred4=pd.DataFrame(val_pred4, columns = ['4'])
test_pred4=pd.DataFrame(test_pred4, columns = ['4'])

Sklearn's logistic regression is made to choose 'one versus rest' approach when the predictions are binary (which is our case), where one model is created for each class and the final class is determined through argmax of probabilities of each class.
df_val=pd.concat([X_valme,val_pred1,val_pred2,val_pred3,val_pred4],axis=1)
df_test=pd.concat([X_testme,test_pred1,test_pred2,test_pred3,test_pred4],axis=1)
df_val
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | diabetes | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.681125 | 1 | 1.385465 | 0.401675 | -1.011988 | -1.196711 | -1.196471 | -0.463477 | 1 | 1 | -0.316177 | 1 | 1 | 0.282230 | 0.159091 | 0.237879 | 0.097264 |
| 1 | -0.348191 | 0 | -0.852790 | -0.290187 | 2.029628 | 0.932943 | 1.556653 | 2.112755 | 3 | 1 | 0.151425 | 1 | 1 | 0.717679 | 0.795455 | 0.800479 | 0.876998 |
| 2 | -1.088710 | 1 | 0.727154 | -0.290187 | -0.403665 | -0.664297 | -0.584665 | -0.034105 | 1 | 1 | -0.644195 | 1 | 0 | 0.322638 | 0.181818 | 0.248494 | 0.168152 |
| 3 | 1.577158 | 0 | -0.984453 | 1.554780 | 0.204658 | -1.196711 | -0.584665 | 1.254011 | 1 | 1 | 2.254586 | 0 | 0 | 0.594198 | 0.613636 | 0.794284 | 0.666422 |
| 4 | -1.977332 | 1 | 2.043775 | 0.478549 | -0.403665 | -0.131884 | -0.278763 | -0.463477 | 1 | 1 | -0.546297 | 1 | 0 | 0.277076 | 0.068182 | 0.246439 | 0.137461 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9563 | 0.688536 | 0 | -0.852790 | 0.017307 | -1.011988 | -1.196711 | -1.196471 | -0.463477 | 1 | 1 | 0.485243 | 1 | 1 | 0.345271 | 0.386364 | 0.223267 | 0.287999 |
| 9564 | -0.792502 | 0 | -0.589466 | 1.247286 | 0.812982 | 0.826460 | 0.883667 | 0.481142 | 1 | 1 | 1.633917 | 1 | 0 | 0.697855 | 0.772727 | 0.802976 | 0.817462 |
| 9565 | -0.051983 | 0 | 0.068844 | -0.213314 | -0.403665 | -0.131884 | -0.278763 | -0.463477 | 2 | 2 | -0.257207 | 0 | 1 | 0.419890 | 0.431818 | 0.327862 | 0.428781 |
| 9566 | 0.836640 | 1 | 0.858816 | -0.443935 | 0.204658 | -1.196711 | -0.584665 | 1.254011 | 3 | 3 | -0.845538 | 1 | 0 | 0.583336 | 0.568182 | 0.767781 | 0.703968 |
| 9567 | 0.392328 | 0 | -0.062818 | 1.247286 | 1.421305 | -0.131884 | 0.638945 | 2.112755 | 1 | 1 | 1.280991 | 0 | 1 | 0.723287 | 0.772727 | 0.816515 | 0.824921 |
9568 rows × 17 columns
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=400)
logreg.fit(df_val,np.ravel(y_valme)) # make y_valme into 1-d array
pred = logreg.predict(df_test)
print(classification_report(y_testme, pred, digits=4))
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_testme, pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7177 0.7648 0.7405 2054
1 0.7473 0.6979 0.7218 2046
accuracy 0.7315 4100
macro avg 0.7325 0.7314 0.7311 4100
weighted avg 0.7324 0.7315 0.7312 4100
<AxesSubplot:title={'center':'Test Set'}>
# Comparing with our current best model: XGB_CLassifier...
temppred = XGB_Classifier.predict(X_testme)
print(classification_report(y_testme, temppred, digits=4))
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 1, figsize=(8, 4))
axes.set_title('Test Set')
sns.heatmap(confusion_matrix(y_testme, temppred),
annot = True, fmt=".0f", annot_kws={"size": 18})
precision recall f1-score support
0 0.7095 0.7824 0.7442 2054
1 0.7564 0.6784 0.7153 2046
accuracy 0.7305 4100
macro avg 0.7329 0.7304 0.7297 4100
weighted avg 0.7329 0.7305 0.7297 4100
<AxesSubplot:title={'center':'Test Set'}>
The ensembled model gave only a small advantage compared to the indicidual XGBoost model, but it may be improved with models with more differentiated classification methods.
# Saving ensemble model
filename = 'logreg.sav'
pickle.dump(logreg, open(filename, 'wb'))
Required inputs from user\:
print("Hi! We will let you know your risk for cardiovascular disease according to the following indicators, do follow through and give us a shot!\n")
time.sleep(1)
data = {}
data['year'] = int(input("Current age (year): "))
data['gender'] = int(input("Gender (0 for female and 1 for male): "))
data['height'] = float(input("Height (m): "))
data['weight'] = float(input("Weight (kg): "))
data['ap_hi'] = float(input('Systolic bloood pressure (mmHg): '))
data['ap_lo'] = float(input('Diastolic bloood pressure (mmHg): '))
data['map'] = float((data['ap_hi']+2*data['ap_lo'])/3)
data['pp'] = float(data['ap_hi'] - data['ap_lo'])
data['cholesterol'] = int(input('Cholestrol level (1 for normal, 2 for above normal, 3 for well above normal): '))
data['gluc'] = int(input('Blood glucose level (1 for normal, 2 for above normal, 3 for well above normal): '))
data['bmi'] = float(data['weight']/data['height']**2)
data['active'] = int(input('Excercise Regularly? (0 for no, 1 for yes): '))
data['diabetes'] = int(input('Diabetes (0 for no, 1 for yes): '))
Hi! We will let you know your risk for cardiovascular disease according to the following indicators, do follow through and give us a shot! Current age (year): 60 Gender (0 for female and 1 for male): 1 Height (m): 1.8 Weight (kg): 90 Systolic bloood pressure (mmHg): 160 Diastolic bloood pressure (mmHg): 100 Cholestrol level (1 for normal, 2 for above normal, 3 for well above normal): 3 Blood glucose level (1 for normal, 2 for above normal, 3 for well above normal): 3 Excercise Regularly? (0 for no, 1 for yes): 0 Diabetes (0 for no, 1 for yes): 1
pred_df = pd.DataFrame(data=[[v for k, v in data.items()]], columns=[key for key in data])
pred_df
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | diabetes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60 | 1 | 1.8 | 90.0 | 160.0 | 100.0 | 120.0 | 60.0 | 3 | 3 | 27.777778 | 0 | 1 |
scaler = pickle.load(open('scaler.sav', 'rb'))
to_be_scaled_feat = ['year', 'height', 'weight', 'ap_hi', 'ap_lo', 'map', 'pp', 'bmi']
pred_df[to_be_scaled_feat] = scaler.transform(pred_df[to_be_scaled_feat])
pred_df
| year | gender | height | weight | ap_hi | ap_lo | map | pp | cholesterol | gluc | bmi | active | diabetes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.984743 | 1 | -21.418404 | 1.247286 | 2.029628 | 1.99777 | 2.168458 | 1.254011 | 3 | 3 | 0.096716 | 0 | 1 |
# loading all models
XGB_Classifier = pickle.load(open('xgb.sav', 'rb'))
knn = pickle.load(open('knn.sav', 'rb'))
svc = pickle.load(open('svc.sav', 'rb'))
nn = tf.keras.models.load_model("nnmodel")
logreg = pickle.load(open('logreg.sav', 'rb'))
# making predictions with each model
pred1=XGB_Classifier.predict_proba(pred_df)
pred1=pd.DataFrame(pd.DataFrame(pred1)[1])
pred2=knn.predict_proba(pred_df)
pred2=pd.DataFrame(pd.DataFrame(pred2)[1])
pred2.rename(columns={1:'2'}, inplace=True)
pred3=svc.predict_proba(pred_df)
pred3=pd.DataFrame(pd.DataFrame(pred3)[1])
pred3.rename(columns={1:'3'}, inplace=True)
pred4=nn.predict(pred_df)
pred4=pd.DataFrame(pred4, columns = ['4'])
# concatenating indicidual model predictions with user data
pred_test=pd.concat([pred_df,pred1,pred2,pred3,pred4],axis=1)
# making ensembled prediction
risk = logreg.predict_proba(pred_test)[0][1]
According to our models, we are able to determine the overall ranking of features from most important to least important as shown below:
ap_hiyearcholbmiSince we are not able to advise on our user's age, we will advice on the rest of the variables accordingly. For bmi, since weight is a more controllable factor than height, we will advice the user on weight


Advice Websites
if (risk<=0.5):
print('Your risk for cardiovasvular disease is', round(risk*100, 2), '%', ', keep it up!')
else:
if (risk<0.7):
print('Your risk for cardiovasvular disease is', round(risk*100, 2), '%', ', please take action to pull your health back on track!')
else:
print('Your risk for cardiovasvular disease is', round(risk*100, 2), '%', ', please visit the doctor immediately for a thorough diagnosis!')
if(data['ap_hi'] > 120):
print('You are at risk of hypertension, you can visit this recommended website for some advice: https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/in-depth/high-blood-pressure/art-20046974')
if(data['cholesterol'] > 1):
print('You are at risk of high cholesterol, you can visit this recommended website for some advice: https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/in-depth/reduce-cholesterol/art-20045935')
if(data['bmi']>=25):
print('You are overweight, you can visit this recommended website for some advice: https://reverehealth.com/live-better/tips-to-maintain-a-healthy-weight/')
if(data['bmi']<18.5):
print('You are underweight, you can visit this recommended website for some advice: https://rb.gy/hurwan')
Your risk for cardiovasvular disease is 91.49 % , please visit the doctor immediately for a thorough diagnosis! You are at risk of hypertension, you can visit this recommended website for some advice: https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/in-depth/high-blood-pressure/art-20046974 You are at risk of high cholesterol, you can visit this recommended website for some advice: https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/in-depth/reduce-cholesterol/art-20045935 You are overweight, you can visit this recommended website for some advice: https://reverehealth.com/live-better/tips-to-maintain-a-healthy-weight/